Dense pose estimation is a dense 3D prediction task for instance-level human analysis, aiming to map human pixels from an RGB image to a 3D surface of the human body. Due to a large amount of surface point regression, the training process appears to be easy to collapse compared to other region-based human instance analyzing tasks. By analyzing the loss formulation of the existing dense pose estimation model, we introduce a novel point regression loss function, named Dense Points} loss to stable the training progress, and a new balanced loss weighting strategy to handle the multi-task losses. With the above novelties, we propose a brand new architecture, named UV R-CNN. Without auxiliary supervision and external knowledge from other tasks, UV R-CNN can handle many complicated issues in dense pose model training progress, achieving 65.0% $AP_{gps}$ and 66.1% $AP_{gpsm}$ on the DensePose-COCO validation subset with ResNet-50-FPN feature extractor, competitive among the state-of-the-art dense human pose estimation methods.
translated by 谷歌翻译
人类垫子是指从具有高质量的自然图像中提取人类部位,包括人类细节信息,例如头发,眼镜,帽子等。这项技术在电影行业的图像合成和视觉效果中起着至关重要的作用。当绿屏不可用时,现有的人类底漆方法需要其他输入(例如Trimap,背景图像等)或具有较高计算成本和复杂网络结构的模型,这给应用程序带来了很大的困难实践中的人类垫子。为了减轻此类问题,大多数现有方法(例如MODNET)使用多分支为通过细分铺平道路,但是这些方法并未充分利用图像功能,并且仅利用网络的预测结果作为指导信息。因此,我们提出了一个模块来生成前景概率图,并将其添加到MODNET中以获得语义引导的Matting Net(SGM-NET)。在只有一个图像的条件下,我们可以实现人类的效果任务。我们在P3M-10K数据集上验证我们的方法。与基准相比,在各种评估指标中,我们的方法显着改善。
translated by 谷歌翻译
在空中杂种大规模多输入多输出(MIMO)和正交频施加多路复用(OFDM)系统中,如何设计具有有限的飞行员和反馈开销的光谱效率宽带多用户混合波束,这是具有挑战性的。为此,通过将关键传输模块建模为端到端(E2E)神经网络,本文提出了一个数据驱动的深度学习(DL)基于时间划分双工(TDD)的基于数据驱动的深度学习(DL)的统一混合边际框架和具有隐式通道状态信息(CSI)的频分隔双链(FDD)系统。对于TDD系统,提出的基于DL的方法共同对上行链路飞行员组合和下行链路混合光束模块作为E2E神经网络。在FDD系统中,我们将下行链路飞行员传输,上行链路CSI反馈和下行链路混合光束形成模块作为E2E神经网络建模。与分别处理不同模块的常规方法不同,提出的解决方案同时以总和速率作为优化对象优化了所有模块。因此,通过感知空对地面大规模MIMO-OFDM通道样本的固有属性,基于DL的E2E神经网络可以建立从通道到波束形式的映射函数,以便可以避免使用显式通道重建,以减少飞行员和反馈开销。此外,实用的低分辨率相变(PSS)引入了量化约束,从而导致训练神经网络时棘手的梯度反向传播。为了减轻阶段量化误差引起的性能损失,我们采用转移学习策略,以基于假定理想的无限分辨率PSS的预训练网络来进一步调整E2E神经网络。数值结果表明,我们的基于DL的方案比最先进的方案具有相当大的优势。
translated by 谷歌翻译
Function approximation (FA) has been a critical component in solving large zero-sum games. Yet, little attention has been given towards FA in solving \textit{general-sum} extensive-form games, despite them being widely regarded as being computationally more challenging than their fully competitive or cooperative counterparts. A key challenge is that for many equilibria in general-sum games, no simple analogue to the state value function used in Markov Decision Processes and zero-sum games exists. In this paper, we propose learning the \textit{Enforceable Payoff Frontier} (EPF) -- a generalization of the state value function for general-sum games. We approximate the optimal \textit{Stackelberg extensive-form correlated equilibrium} by representing EPFs with neural networks and training them by using appropriate backup operations and loss functions. This is the first method that applies FA to the Stackelberg setting, allowing us to scale to much larger games while still enjoying performance guarantees based on FA error. Additionally, our proposed method guarantees incentive compatibility and is easy to evaluate without having to depend on self-play or approximate best-response oracles.
translated by 谷歌翻译
Correlated Equilibrium is a solution concept that is more general than Nash Equilibrium (NE) and can lead to outcomes with better social welfare. However, its natural extension to the sequential setting, the \textit{Extensive Form Correlated Equilibrium} (EFCE), requires a quadratic amount of space to solve, even in restricted settings without randomness in nature. To alleviate these concerns, we apply \textit{subgame resolving}, a technique extremely successful in finding NE in zero-sum games to solving general-sum EFCEs. Subgame resolving refines a correlation plan in an \textit{online} manner: instead of solving for the full game upfront, it only solves for strategies in subgames that are reached in actual play, resulting in significant computational gains. In this paper, we (i) lay out the foundations to quantify the quality of a refined strategy, in terms of the \textit{social welfare} and \textit{exploitability} of correlation plans, (ii) show that EFCEs possess a sufficient amount of independence between subgames to perform resolving efficiently, and (iii) provide two algorithms for resolving, one using linear programming and the other based on regret minimization. Both methods guarantee \textit{safety}, i.e., they will never be counterproductive. Our methods are the first time an online method has been applied to the correlated, general-sum setting.
translated by 谷歌翻译
In this paper, we study the \underline{R}obust \underline{o}ptimization for \underline{se}quence \underline{Net}worked \underline{s}ubmodular maximization (RoseNets) problem. We interweave the robust optimization with the sequence networked submodular maximization. The elements are connected by a directed acyclic graph and the objective function is not submodular on the elements but on the edges in the graph. Under such networked submodular scenario, the impact of removing an element from a sequence depends both on its position in the sequence and in the network. This makes the existing robust algorithms inapplicable. In this paper, we take the first step to study the RoseNets problem. We design a robust greedy algorithm, which is robust against the removal of an arbitrary subset of the selected elements. The approximation ratio of the algorithm depends both on the number of the removed elements and the network topology. We further conduct experiments on real applications of recommendation and link prediction. The experimental results demonstrate the effectiveness of the proposed algorithm.
translated by 谷歌翻译
Learning with noisy label (LNL) is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering the properties of videos, such as computational cost and redundant information, is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) A lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category; 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed tru{\bf N}cat{\bf E}-split-contr{\bf A}s{\bf T} (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10\% of it, our method achieves over 0.4 noise detection F1-score and 5\% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80\%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6\%.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Conditional normalizing flows can generate diverse image samples for solving inverse problems. Most normalizing flows for inverse problems in imaging employ the conditional affine coupling layer that can generate diverse images quickly. However, unintended severe artifacts are occasionally observed in the output of them. In this work, we address this critical issue by investigating the origins of these artifacts and proposing the conditions to avoid them. First of all, we empirically and theoretically reveal that these problems are caused by ``exploding variance'' in the conditional affine coupling layer for certain out-of-distribution (OOD) conditional inputs. Then, we further validated that the probability of causing erroneous artifacts in pixels is highly correlated with a Mahalanobis distance-based OOD score for inverse problems in imaging. Lastly, based on our investigations, we propose a remark to avoid exploding variance and then based on it, we suggest a simple remedy that substitutes the affine coupling layers with the modified rational quadratic spline coupling layers in normalizing flows, to encourage the robustness of generated image samples. Our experimental results demonstrated that our suggested methods effectively suppressed critical artifacts occurring in normalizing flows for super-resolution space generation and low-light image enhancement without compromising performance.
translated by 谷歌翻译
Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.
translated by 谷歌翻译